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Abstract. The aim of this paper is to generalize the PAC-Bayesian theorems 
proved by Catoni [§1[8] in the classification setting to more general problems 
of statistical inference. We show how to control the deviations of the risk of 
randomized estimators. A particular attention is paid to randomized estima- 
tors drawn in a small neighborhood of classical estimators, whose study leads 
to control the risk of the latter. These results allow to bound the risk of very 
general estimation procedures, as well as to perform model selection. 
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1. Introduction 

The aim of this paper is to perform statistical inference with observations in a 
possibly large dimensional space. Let us first introduce the notations. 

1.1. General notations. Let N € IN* be the number of observations. Let {Z,B) 
be a measurable space and Pi, Pjy be N probability measures on this space, 
unknown to the statistician. We assume that 

(Zi, Z N ) 

is the canonical process on 

(Z N ,B^ N ,P 1 ^...®P N ). 

Definition 1.1. Let us put 

P = Pi®...®Pjv, 

and 

i=l 

We want to perform statistical inference on a general parameter space 8, with 
respect to some loss function 

£ e : Z -> R, 9 e e. 

Definition 1.2 (Risk functions). We introduce, for any 8 6 O, 

_ i N 

i=i 

the empirical risk function, and 

i ff 

R{B)=T(l e ) = -'£ i P i {£e), 

the risk function. 

We now describe three classical problems in statistics that fit the general context 
described above. 

Example 1.1 (Classification). We assume that Z — X x y where X is a set of 
objects and y a finite set of possible labels for these objects. Consider a set of 
classification functions {fg : X — > y, 9 6 9} which assign to each object a label. 
Let us put, for any z = (x,y) S Z, Iq{z) = ip (fg(x),y) where ip is some symmetric 
discrepancy measure. The most usual case is to use the 0-1 loss function -0(y, y') — 
S y (y') . If moreover \y\ — 2 we can decide that y = {— 1,+1} and set ip(y,y') — 
t^(yy') . However, in many practical situations, algorithmic considerations lead 
to use a convex upper bound of this loss function, like 

^(2/: y') = (1 ~ yy')+ = max(l — yy',0), the "hinge loss", 

^{y^y') — ex P(—yy r ), the exponential loss, 

^{Ui y') — (1 ~ yy') 2 1 the least square loss. 

For example, Cortes and Vapnik [TDj generalized the SVM technique to non-separable 
data using the hinge loss, while Schapire, Freund, Bartlett and Lee [19] gave a sta- 
tistical interpretation of boosting algorithm thanks to the exponential loss. See 
Zhang [22] for a complete study of the performance of classification methods using 
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these loss functions. Remark that in this case, fg is allowed to take any real value, 
and not only —1 or +1, although the labels Yi in the training set are either —1 or 
+1. 

Example 1.2 (Regression estimation). The context is the same except that the 
label set y is infinite, in most case it is E, or an interval of JR. Here, the most usual 
case is the regression with quadratic loss, with tp(y,y') = (y — y 1 ) 2 , however, more 
general cases can be studied like the l p loss ip(y,y') = (y — y') p for some p > 1. 

Example 1.3 (Density estimation). Here, we assume that Pi = ... = Pn = P 
and consequently that P = P® N , and we want to estimate the density / = dP/dfi 
of P with respect to a known measure fi. We assume that we are given a set of 
probability measures {Qg,0 £ 9} with densities qg — dQg/d^i and we use the loss 
function £g(z) = — log [qg(z)]. Indeed in this case, we can write under suitable 
hypotheses 

^)=P(-I0 g 0^=p(-I0 g 0^)=p(l 0g 0^) + P(l0 g 0§) 

= /C(P,Q e )-P(logo/), 

showing that the risk is the Kullback-Leibler divergence between P and Qg up to a 
constant (the definition of /C is reminded in this paper, see Definition II .81 page l5j) . 

In each case the objective is to estimate arg mini? on the basis of the observations 
Zi, Zjy - presumably using in some way or another the value of the empirical risk. 
We have to notice that when the space 9 is large or complex (for example a vector 
space with large dimension), arg mini? and argminr can be very different. This 
does not happen if 9 is simple (for example a vector space with small dimension) , 
but such a case is less interesting as we have to eliminate a lot of dimensions in G 
before proceeding to statistical inference with no guarantees that these directions 
are not relevant. 

1.2. Statistical learning theory and PAC-Bayesian point of view. The 

learning theory point of view introduced by Vapnik and Cervonenkis (|9J, see Vap- 
nik [21] for a presentation of the main results in English) gives a setting that proved 
to be adapted to deal with estimation problems in large dimension. This point of 
view received an important interest over the past few years, see for example the 
well-known books of Devroye, Gyi£jrfi and Lugosi [11] . Friedman, Hastie and Tib- 
shirani [12] or more recently the paper by Boucheron, Bousquet and Lugosi [5] and 
the references therein, for a state of the art. 

The idea of Vapnik and Cervonenkis is to introduce a structure, namely a family 
of submodels ©i, 62, ■■■ The problem of model selection then arises: we must 
choose the submodel 9^ in which the minimization of the empirical risk r will lead 
to the smallest possible value for the real risk R. This choice requires to estimate 
the complexity of submodels <dk • An example of complexity measure is the so-called 
Vapnik Cervonenkis dimension or VC-dimension, see [9ll2T]. 

The PAC-Bayesian point of view, introduced in the context of classification by 
McAllester [TBI [17] is based on the following remark: while classical measures of 
complexity (like VC-dimension) require theoretical results on the submodels, the 
introduction of a probability measure ir on the model 9 allows to measure empiri- 
cally the complexity of every submodel. In a more technical point of view, we will 
see later that n allows a generalization of the so-called union bound (see [T7] for 
example). This point of view might be compared with Rissanen's work on MDL 
(Minimum Description Length, see [18]) making a link between statistical inference 
and information theory, and — log7r(#) can be seen as the length of a code for the 
parameter 9 (at least when 9 is finite). 
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The PAC-Bayesian point of view was developed in more contexts (classification, 
least square regression and density estimation) by Catoni [7], and then improved 
in the context of classification by Catoni [6], Audibert [3] and in the context of 
least square regression by Audibert [2] and of regression with a general loss in our 
PhD thesis [Q. The most recent work in the context of classification by Catoni 
[8J improves the upper-bound given on the risk of the PAC-Bayesian estimators, 
leading to purely empirical bounds that allow to perform model selection with no 
assumption on the probability measure P. The aim of this work is to extend these 
results to the very general context of statistical inference introduced in subsection 
II, 14 that includes classification, regression with a general loss function and density 
estimation. 

Let us introduce our estimators. 

Definition 1.3. Let us assume that we have a family of functions 

t/4 : Z -> RU{+oo} 

indexed by i in a finite or countable set I and by 9 £ 9. For every i £ I we choose: 

6i £ argminP (ibl) . 
eee y ' 

Example 1.4 (Empirical risk minimization and model selection). If we take I = 
{0} we can choose tpg(z) = lg(z) and we obtain P (tpg) = r(8) and so 

8° = argminr(60 
see 

the empirical risk minimizer. In the case where the dimension of 9 is large, we can 
choose several submodels indexed by a finite or countable family /: (9^, i £ I). In 
order to obtain 

9' = are min r{9) 

eee, 

we can put 

r i e (.) if 9£&i 
= { 

[ +00 otherwise. 

The problem of the selection of the 9{ with the smallest possible risk (so-called 
model selection problem) can be solved with the help of PAC-Bayesian bounds. 

Note that PAC-Bayesian bounds given by Catoni [HI [TJ [8] usually apply to "ran- 
domized estimators". More formally, let us introduce a cr-algebra T on 9 and a 
probability measure 7r on the measurable space (9,T). We will need the following 
definitions. 

Definition 1.4. For any measurable set (E,£), we let M.\{E) denote the set of 
all probability measures on the measurable space (E,£). 

Definition 1.5. In order to generalize the notion of estimator (a measurable func- 
tion Z N — * 9), we call a randomized estimator any function p : Z N — > M]_(Q) 
that is a regular conditional probability measure. For the sake of simplicity, the 
sample being given, we will write p instead of p (Zi, Zjy), 

PAC-Bayesian bounds for randomized estimators are usually given for their mean 
risk 

R{8)dp{9), 

ee 

whereas here we will rather focus on R(9), where 9 is drawn from p and p is highly 
concentrated around a "classical" (deterministic) estimator 8%. 
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1.3. Truncation of the risk. In this subsection, we introduce a truncated version 
of the relative risk of two parameters 9 and 9'. 



Definition 1.6. We put, for any A 6 W + and {9,9') e 6 2 

Rx{9,9')=V 



N 



Note of course that if P-almost surely, we have £g — £e> < N/X then R\{9,9') = 
R(9)-R(9'). 

In what follows, we will give empirical bounds on R\(9,9') for some 9 and 9' 
chosen by some statistical procedure. One can wonder why we prefer to bound this 
truncated version of the risk instead of R{9) — R(9'). The reason is the following. In 
this paper, we want to give bounds that hold with no particular assumption on the 
unknown data distribution P. However, it is clear that we cannot obtain a purely 
empirical bound on R(9) — R{9') with no assumption on the data distribution, as 
it is shown by the following example. 

Example 1.5. Let us choose c > and A > 0. We assume that Pi = ... = Pn and 
that 6 = {9,9'} with lg,(z) = 0. We put l 9 (Z) = cN with probability l/N and 
otherwise. Then we have R{9') = and 

a(*) = Ic* + (i-l)o = c 

while r(9') = and with probability at least (1 — 1/N) N ~ exp(— 1) we also have 
r{9) = 0, this means that we cannot upper bound precisely R(9)—R{9') by empirical 
quantities with no assumption. 

So, we introduce the truncation of the risk. However, two remarks shall be made. 
First, in the case of a bounded loss function i, with a large enough ratio N/X we 
have R x (9, 9') = R{9) - R{9'). 

In the general case, if we want to upper bound R{9) — R(9') we can make ad- 
ditional hypotheses on the data distribution, ensuring that we can dispose of a 
(known) upper-bound : 

A x (9,9') > R(9) - R(9') - R x (9,9') 

as it is done in our PhD Thesis [T]. For the sake of completeness, such an upper 
bound is given in the Appendix, pagel28l 

1.4. Main tools. In this subsection, we give two lemmas that will be useful in 
order to build PAC-Bayesian theorems. First, let us recall the following definition. 
In this whole subsection, we assume that (E, £) is an arbitrary measurable space. 

Definition 1.7. For any measurable function h : E — > P, for any measure m £ 
M\(E) we put 



m(h) = sup / [h(x) AB)m(dx). 

BeRJE 



Definition 1.8 (Kullback-Leibler divergence). Given a measurable space (E,£), 
we define , for any (m,n) e {M+(E)] 2 , the Kullback-Leibler divergence function 



IC(m, n) 




dm(e)< lo. 



dm 



if to <C n, 
otherwise. 
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Lemma 1.1 (Legendre transform of the Kullback divergence function). For any 

n G Mr\(E), for any measurable function h : E — > R such that n(expo/i) < +oo 
we have 



(1.1) 



logn(expo/i) = sup 

m£M\(E) 



m{h) — KL{m, n] 



where by convention oo — oo = — oo. Moreover, as soon as h is upper-bounded on 
the support of n, the supremum with respect to m in the right-hand side is reached 
for the Gibbs distribution, n e xp(/i) given by: 



Ve e E, 



dn c ^ v (h) , s _ exp[/i(e)] 



dn 



7r(exp oh) 



The proof of this lemma is given at the end of the paper, in a section devoted to 
proofs (subsection 15.11 page [T5|) . We now state another lemma that will be useful 
in the sequel. First, we need the following definition. 



Definition 1.9. We put, for any a G R+, 

$ Q : ]— oo, l/a[ — > R 

1 1 ► - 



log (1 - at) 



Note that <J>q, is invertible, that for any u G R, 



K 1 (u) = 



1 — cxp (— an) 



< u, 



and that 



1. Also note that for a > 0, $ a is convex and that 



&a(x) > x. An elementary study of this function also proves that for any C > 0, 
for any a G ]0, 1/(2C)[ and any p G [0, C] we have: 

$ a (p) <P+-^- 

We can now give the lemma. 

Lemma 1.2. We have, for any A G R+, for any a G]0, 1], /or any 



Fexpi A$, 



i? A (0, 0') 



A^ 



JV 



E*- 



aN 



?') g e 2 , 
= i. 



The proof is almost trivial, we give it now in order to emphasize the role of the 
truncation and of the change of variable. 



Proof. For any A G Rl, for any (0, 9') G 6 2 , 



N 



Pexp^ A$, 



N 



E $ - 



i'=l 



{£ e - £ e >) (Zi) A 



aN 



N 



Pex p^ E log 



A 



1-- (^-^)(^i)A — 
-log 



_A 

iV" 



Pi ( {le - le>)(Zi) A 



aN 



P 



iV 



Hi 



((l e -l e ,){ Zl )A^) 



ll-APi^-WOAf) 
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N 



i-^(g 9 -^K^)A^) 



= 1. 



□ 



Note that this lemma will be used as an alternative to Hoeffding's or Bernstein's 
(see OH]) inequalities in order to prove PAC inequalities. 



1.5. A basic PAC-Bayesian Theorem. Let us integrate Lemma fTT2l with respect 
to (6,9') with a given probability measure n = ir ® n' with (ir,ir') £ [M\(Q)] 2 
Applying Fubini-Tonelli Theorem we obtain: 



(1.2) P< 



d(Tr®ir')(9,9')exp\ A$ A 
)ee 2 { N 

A N 



Ra (9,9') 



This implies that for any (p, p') £ [Ml (&)] , 



N 

i=l 
1 CftM2 



a N' 

(e e ~ e e ,) ( Zi) a — 



= i. 



P<! / d(p® p')(8,6')exp 

l(e,e>)ee 2 [ 

N 

t 



\ A$ A. 



Ra (6,6') 



(*) A - 



log 



d(7r®7r') 1 ' 



< 1. 



(This inequality becomes an equality when 7r <C p and 7r' <C p'.) 

Theorem 1.3. Let us assume that we have (it, ir') £ (O) 2 , and two randomized 
estimators p and p' . For any e > 0, for any (a, A) e]0, 1] x with ~P(p <8> p')- 
probability at least 1 — e over the sample (^j)i=i,...,iv and the parameters (6, 8'), we 
have: 



R 



(£ § -£ d ,)(Z t )A 



aN 



log 



dp 
tin 



log 



A J 

In order to provide an interpretation of Theorem 11.31 let us give the following 
corollary in the bounded case, which is obtained using basic properties of the func- 
tion $ given just after Definition 11.91 page El In this case, the parameter a is just 
set to 1. 

Corollary 1.4. Let us assume that for any (6,z) £ O x Z,0 < lg(z) < C. Let us 
assume that we have (tt,tt') £ A'l^O) 2 , and two randomized estimators p and p' . 
For any e > 0, for any A £}0, N/ (2C)], with P(p <g> p') -probability at least 1 — e we 
have: 



R(6) - R(9') < *-Ar(6) - r(9') + A P [(j. _ 



log 



d /.) 



+ log 



s 



P. ALQUIER 



We can see that the difference of the "true" risk of the randomized estimators 8 
and 6', drawn independently from p and p' , is upper bounded by the difference of 
the empirical risk, plus a variance term and a complexity term expressed in terms 
of the log of the density of the randomized estimator with respect to a given prior. 
So Theorem 11.31 provides an empirical way to compare the theoretical performance 
of two randomized estimators, leading to applications in model selection. This 
paper is devoted to improvements of Theorem 11.31 (we will see in the sequel that 
this theorem does not necessarily lead to optimal estimators) and to the effective 
construction of estimators using variants of Theorem 11.31 

Now, note that the choice of the randomized estimators p and p' is not straight- 
forward. The following theorem, which gives an integrated variant of Theorem [L3l 
can be usefull for that purpose. 

Theorem 1.5. Let us assume that we have (ir,ir') e M\(Q) 2 . For any e > 0, for 

any (a, A) e]0, 1] x with P -probability at least 1 — e, for any (p, p') G M\ (O) 2 , 



H 2 



R h {6,0')d{p® p'){9,0') 



< $ 




i N 

N ^ 



i=l 



aN 

(e e - e e ,) ( Zl ) a — 



d(p®p>)(6,9') 

The proof is given in subsection 15.21 page H21 

1.6. Main results of the paper. In our PhD dissertation [I], a particular case 
of Theorem 11.51 is given and applied to regression estimation with quadratic loss in 
a bounded model of finite dimension d. In this particular case, it is shown that the 
estimators based on the minimization of the right-hand side of Theorem ll.5l do not 
achieve the optimal rate of convergence: d/N, but only (d\ogN)/N. A solution is 
given by Catoni in [7] and consists in replacing the prior tt by the so-called "localized 
prior" KcKvi-pR) for a given (5 > 0. The main problem is that this choice leads to 
the presence of non-empirical terms in the right-hand side, K.(p 7 tt cxp (-/3R))- 

In Section^ we give an empirical bound for this term /C(p, 7r oxp (_^)). We also 
give a heuristic that leads to this technique of localization. 

In Section [3l we show how this result, combined with Theorem [LU leads to the 
effective construction of an estimator that can reach optimal rates of convergence. 

The proofs of the theorems stated in this paper are gathered in Section [5l 



2. Empirical bound for the localized complexity and localized 

PAC-Bayesian theorems 

2.1. Mutual information between the sample and the parameter. Let us 

consider Theorem 11.51 with p' = ir' = Se> for a given parameter 8'. For the sake 
of simplicity, let us assume in this subsection that we are in the bounded case 
(l 6 bounded by C). Theorem [L~5l ensures that, for any A e]0, N/(2C)[, with P- 
probability at least 1 — e, for any p G M\(0), 



p{R)-R(8')<p{r)-r{6') + 
This is an incitation to choose 



2N 



P 



arg mm 



p(r) 



2N 



-P 



(i e -i e ,yd P (o) 



(k-l e >fdp(d) 



JC(p, tt) + log i 



A 
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However, if we choose to neglect the variance term, we may consider the following 
randomized estimator: 



arg mm 



fi(r) 



A 



Actually, in this case, Lemma [Lll leads to: 

P = ^cxp(-Xr)- 

Let us remark that, for any (p, ir) £ .M^(8) we have: 



(2.1) 



P 



/C(p,7r) 



= p 



JC(p,P(p)) +K(PQ>U) 



This implies that, for a given data-dependent p, the optimal deterministic measure 
ir is P(p) in the sense that it minimizes the expectation of JC(p, n) (left-hand side of 
Equation l2.ip . making it equal to the expectation of IC(p, P(p)). This last quantity 
is the mutual information between the estimator and the sample. 

So, for p = 7r 0X p(-Ar) j this is an incitation to replace the prior ir with P (7r C xp(-Ar)) . 
It is then natural to approximate this distribution by TT CX p(~XR) ■ 

In what follows, we replace it by 7r C xp(-/3fl) for a given > 0, keeping one more 
degree of freedom. Now, note that Theorem 1 1 . 51 gives : 



p(R)-R{9') 



<p(r)-r(9') + —F 



(l e -le'fdp(9) 



K (p, IT, 



logi 



A 



and note that the upper bound is no longer empirical (observable to the statistician) . 

The aim of the next subsection is to upper bound JC (p, Tr e xp(-0R)) by an empirical 
bound in a general setting. 

2.2. Empirical bound of the localized complexity. 

Definition 2.1. Let us put, for any (a, A) e]0, 1] x W + and (0,9') £ 6 2 , 



27V I A 
~Y 1 N 



N 



^— < 1 



i=l 



(to - £ e >) (ZJ A 



aN 



r(9) - r(9') 



Theorem 2.1. Let us choose a distribution ir G ^^(8). For any e > 0, for any 

(a,7,/3) e]0, 1] x x JTj_ such that (3 < 7, with P -probability at least 1 — e, for 
any p € M\(Q), 

P 1 

£ (p, 7r cxp (-/3fl)) < BK a ,p n (p, it) + ^— - log - 



where 



BIC a ^ n (p,TT) = 1 - 



(P, 7Toxp(-/3r)) 



+ log / 7r oxp (_^ I . ) (d6'')exp 



p(d9)^v a ^(9,9 / )+(3A i (9,9')^ 



The proof is given in the section dedicated to proofs, more precisely in subsection 
5~3l page [161 Note that the localized entropy term is controlled by its empirical 
counterpart together with a variance term. 

Before combining this result with Theorem [L5l we give the analogous result for 
the non-integrated case, which proof is also given in subsection 15.31 



10 



P. ALQUIER 



Theorem 2.2. Let us choose a distribution ir £ A^+(0) and a randomized esti- 
mator p. For any e > and rj > 0, for any (a, 7,/?) £]0, 1] x P?j_ x ^+ smc/i i/iai 
/3 < 7, wii/i P p-probability at least 1 — e, 



log 



where 



7 



exp[-/3fl] 



<X) aA7 (p )7 r)(6i) + ^-log- 
7-/3 e 



log 



cxp[— /3r] 



+ log / 7Tc X p(-/3r)(^') ex P 
JO 

2.3. Localized PAC-Bayesian theorems. 



-(*) 

|l^(0>)+/?A ? (M') 



Definition 2.2. From now on, we will deal with model selection. We assume that 
we have a family of submodels of 8: (Gj,i € I) where I is finite or countable. We 
also choose a probability measure fi £ M + (I), and assume that we have a prior 
distribution n l £ .M^Oj) for every i. 



We choose 



(6/ 



and apply Theorem 11.31 that we combine with Theorem 12.21 by a union bound 
argument, to obtain the following result. 

Theorem 2.3. Let us assume that we have randomized estimators (pi)i£i such that 
Pi{@i) = 1, for any e > 0, for any (a, f3, f3' , 7, 7', A) £]0, 1] x (R* + ) 5 such that f3 < 7 
and (3' < 7', with P &) ieI pi-probability at least 1 — e over the sample (Z n )n=i,...,jv 
and the parameters (6i)i£i, for any £ I 2 we have: 



Ra (k §«) < SI 1 r$) - r(^) + — « a , x 



1 

+ A 



©a,/3, 7 (P.T < )(^) +^a ! ^,-y'(p,^')(6> i /) 



7 log 



efi(i)n(i') 



7-/3 7' - /?' 

In the same way, we can give an integrated variant, using Theorem 11.51 and 
Theorem O 

Theorem 2.4. For any e > 0, /or any (a, (3, (3', 7, 7', A) £]0, 1] x (P^) 5 swc/i ttai 
/3 < 7 and /?' < 7', wiift P -probability at least 1 — s, for any £ I 2 and 

{p, P ')eM\(Q l )xM\{Q l >), 



d{p® p'){0,0')Rx{0,0') 



< * |p(r) - p'(r) + — jf^ d(p ® p')(0, 6') v atj ,(6, 9') 

BJCa^p, 7T*) + BK a ,p„,(f/, 7T*') + (l + ^ + ^) log 
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2.4. Choice of the parameters. In this subsection, we explain how to choose the 
parameters A, 0,(3', 7 and 7' in Theorems 12.31 and 12.41 In some really simple situa- 
tions (parametric model with strong assumptions on P), this choice can be made on 
the basis of theoretical considerations, however, in many realistic situations, such 
hypothesis cannot be made and we would like to optimize the upper bound in the 
Theorems with respect to the parameters. This would lead to data-dependant val- 
ues for the parameters, and this is not allowed by Theorems 12.41 and 12.31 Catoni [8] 
proposes to make a union bound on a grid of values of the parameters, thus allowing 
optimization with respect to these parameters. We apply this idea to Theorem l2.4l 
and obtain the following result. 

Theorem 2.5. Let us choose a measure v £ M.\{&) that is supported by a finite 
or countable set of points, supp(v) . Let us assume that we have randomized esti- 
mators (pi.f3)i£i,f3£supp(v) such that Pi > /3(Qi) = 1. For any s > and a s]0, 1], with 
I 3 &>iei (jesupp{ij) Pi. p -probability at least 1 — e over the sample (i>n)n=i,...,iV and the 
parameters (9i,p)i£i,p£ S upp(u) , f or an V S I 2 and {13,(3') 6 supp(v) 2 we have: 

<B((i,0),(i',l3'j) = inf ^ 1 \r(e i , )-r(e il , 0l ) 
V / a e]o, +oo[ jv v ' x ' 

7 6]/3,+oo[ *. 
7' £]/3', +oo[ 



+ — V, a (6i 3,6 V B >) + - 
2N '« v ' A 



fo,/3,7(/Oi,/3,7r J )(^i,/3) + V a,/3',j' (Pi,f3'> ^ )(#i',/3') 

(3' 



log 



7-/5 i-P) ev(\)v(rf)v{p)vtf)v(P')ii(j)n(i') 

2.5. Introduction of the complexity function. It is convenient to remark that 
we can dissociate the optimization with respect to the different parameters in The- 
orem 12.51 thanks to the introduction of an appropriate complexity function. The 
model selection algorithm we propose in the next subsection takes advantage of this 
decomposition. 

Definition 2.3. Let us choose some real constants ( > 1, a e]0, 1] and e > 0. We 
assume that some randomized estimators {pi,p)iel ,0esupp{u) have been chosen and 
that we have drawn O^p for every i € I and (3 S supp{v). We define, for any i E I, 

C ( i >P) = r j& f \ V a,P n {pi,p-,K % )(0 i: p) 
76[C/3,+°°[ I 



7-/3 c-i I "sM'M/JMi) 



We have the following result. 

Theorem 2.6. For any (i, i' , (3, [3') S I 2 X supp(v) 2 , 
B({i,(3),{i',(3'j) < inf {^\r(e it0 ) -r(9 v ,p>) 

\ / A>0 jv 



A - \ . C(hf>) +C&,P>) + £l log i^XJ 



+ ^TT V a^{ ( >i,0,Vi',0' 



2N a '* v 1 A 
Note, as a consequence of the concavity of < &^ 1 , that this implies 
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Corollary 2.7. 

S ((*,/?),(*',/?')) +B((i / ,^),(i,/3) 



< 2 inf 



A>0 T7 1 2iV 2 

C( i ,/3)+C( l ',/3') + f ± ilog- 3 



A J 

Corollary 12.71 shows that the symmetric part of B has an upper bound which 
contains only variance and complexity terms. 

3. Application: model selection 

In this section, we propose a general algorithm to select among a family of 
posteriors - and so to perform model selection as a particular case. This algorithm 
was introduced by Catoni [8] in the case of classification. We first give the general 
form of the estimator. We then give an empirical bound on its risk. The last 
subsection is devoted to a theoretical bound under suitable hypothesis. 

3.1. Selection algorithm. We introduce the following definition for the sake of 
simplicity. 

Definition 3.1. Let us put: 

V = {t\...,t M } = {(i,p)eIxsup P (v)}, 
where M — \I\ x \supp(v)\ and the indexation of the t^s is such that 

Cit 1 ) < ... < C(t M ). 

Now, remark that there is no reason for the bound B defined in Theorem 12. 51 to 
be sub-additive. So let us define a sub-additive version of B. 

Definition 3.2. We put, for any (t,f) e V 2 : 

h 

B{t,t')= inf V)B(t fc _i,t fc ). 

h > 1 ' — ' 

(to,-,t h ) 6 V h + l k=1 

Definition 3.3. For any k £ {1, M} we put: 

s(k) = inf {j e {1, M}, B(t k ,t j ) > 0} . 
We are now ready to give the definition of our estimator. 
Definition 3.4. We take as an estimator # t ~ where t — t k and 

k = min (arg max s) . 

3.2. Empirical bound on the risk of the selected estimator. 

Theorem 3.1. Let us put s — s(k). For any e > 0, with V ® teV Pt -probability at 
least 1 — e, 



:(%) 



R[6t) < R 



'0, l<i<s, 

B(t s ^\P) s<j<k, 

B(t, t § ) + B(t s , P), j S (arg max s) 

B(t,P), otherwise. 
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Thus, adding only non negative terms to the bound, 



R(9 { <R 



0. 



1 < 3 < * , 
s <j <k, 



+ < 



B(P,t s ) + B(t s ,P) 

+B(t,t s ) + B{t s J) j e (argmaxs), 
B(P,i) + B(t, P), otherwise. 



For a proof, we refer the reader to Catoni [8] where this Theorem is proved in the 
case of classification, the proof can be reproduced here without any modification. 

Theorem [3TT1 shows that, according to Corollary ETTl (page EES)) , R(§ { ) - R(6 t j) 
can be bounded by variance and complexity terms relative to posterior distributions 
with a complexity not greater than C(P), and an empirical loss in any case not much 
larger than the one of 6 t j . 



3.3. Theoretical bound. In this subsection, we choose p^p as 7T 



exp(— /3r) 



restricted 



to a (random) neighborhood of 0j. More formally, for any p > 0, let us put 

9^ = 1^6 6^ r(6)-mir<p 

and for any q g]0, 1] let us put 

PiAl) = inf {p > °> Kxp(-pr)(®*, P ) > l] ■ 
Then let us choose q once and for all and let us choose pi^ so that 



dp 



dir* 



■(9) 



1 



Pi,f3 



l exp(-/3r) 7r exp(-/5r) ^hPi.ff {<!)) 

Moreover, we assume that < U{z) < C for any 9 E 6 and z E Z, and we fix 
a = 1. In this case, note that for any A < N/(2C) we have: 

«! a(mo <p [ae-^o 2 ■ 

For the sake of simplicity we introduce the following definition. 
Definition 3.5. Let us put, for any (9,9') E 9 2 : 

v{9,9') = P [(l e - lg,f 

and 

v(e,d') = -p[v(9,d')]. 

To obtain the following result we take v as the uniform measure on the grid 
suppiy) = |2°,2 1 ,...,2L 1 rS#J|. 

Theorem 3.2. Let us put, for any i E I , 



and 



6>i = arg min R(9) 

eee z 



9 = arg mini? (9). 

eee 



Let us assume that Mammen and Tsybakov's margin assumption is satisfied, in 
other words let there exist (k,c) E [1, +oo[x]R/!j_ such that 

V6» E 9, [V{9, 9)] K <c [R{9) - R(9)] . 
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Let moreover every sub-model Qi,i 6 I satisfy the following dimension assumption: 
S up{£[t4 p( _ €h) (R)-R^ i )]}<d i 

for a given sequence (di)i & i € (R+) / . Then there is a constant C — C(n, c, C) such 
that, with P ®iej,0eaupp(i/) Pi,p -probability at least 1 - 4e, 



#(0 f ) < mil i?(6> t )+C max 




iei] v ' \\ N 

/ di + logI+logi^ 



V ^ / j j 

For a proof, see subsection 15.41 page fT6l Let us now make some remarks. 

Remark 3.1 (Choice of the parameter q). The better choice for q is obviously q = l- 
In this case, our estimator is drawn randomly from the distribution, 

and the term log(l/g) vanishes. 

However, practitioners worried about the idea to choose randomly in the whole 
space an estimator can use a smaller value of q ensuring that, in any model i and 

for any /3, 



(he) 



< infr + p it p(q), 



so 9i t is drawn in a neighborhood of the minimizer of the empirical risk. 

Remark 3.2 (Margin assumption). The so-called margin assumption 

[V(e,9)] K < c [R(6)-R(0)] 

was first introduced by Mammen and Tsybakov in the context of classification 
[TBI I20|. It has however been studied in the context of general regression by Lecue 
in his PhD Thesis p~4]. The terminology comes from classification, where a similar 
assumption can be described in terms of margin. In the general case however, there 
is no margin involved, but rather a distance V(6, 9') 1 / 2 on the parameter space, 
which serves to describe the shape of the function R in the neighborhood of its 
minimum value R{6). 

Remark 3.3 (Dimension assumption). In many cases, the assumption 

sup{ek xp( _ ?fl ) (ii)-ii@)l}<4 

Sent L L J J 

is just equivalent to the fact that every @i has a finite dimension proportionnal to 
di, 

4. Conclusion 

In this paper we studied a quite general regression problem. We proposed ran- 
domized estimators, that can we drawn in small neighborhoods of empirical min- 
imizers. We proved that these estimators reach the minimax rate of convergence 
under Mammen and Tsybakov's margin assumption. 

We would like also to point out that the techniques used here can be applied in 
a more general context. In particular, Catoni [8] studied the transductive classifi- 
cation setting, where for a given k € IN, we observe the objects X\, • • • , X^+i)N 
and the labels Y\ ,• • • ,Y/v, and we want to predict the kN missing labels Yjv+i j" ' ' 
,Y{k+i)N- In this context, a deviation result equivalent to Lemma fOl (page l6|) can 
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be proved, and from this result we can obtain a theorem similar to Theorem 13.11 
(page [HI) . We refer the reader to our PhD thesis [T] for more details (the trans- 
ductive setting is introduced page 54 and the deviation result is Lemma 3.1 page 
56). 

5. Proofs 

5.1. Proof of Lemma II. 1L For the sake of completeness, we reproduce here the 
proof of Lemma 1X7X1 given in Catoni [6]. 

Proof of Lemma \TJ[ Let us assume that h is upper-bounded on the support of n. 
Let us remark that to is absolutely continuous with respect to n if and only if it is 
absolutely continuous with respect to n cxp ( h y If it is the case, then 

K. (m,n cxp( , l) ) = TO j lo S ("^""^ ~ h \ +logn(expoh.) 

= /C(m, n) — m(h) + logn(exp oh). 

The left-hand side of this equation is nonnegative and cancels only for to — n exp (M . 
Note that it remains valid when to is not absolutely continuous with respect to n 
and just says in this case that +oo = +oo. We therefore obtain 

0= inf [/C(m, n) — m(h)] + log nfexp oh). 

meM\{E) 



This proves the second part of lemma fTTTI For the first part, we do not assume any 
longer that h is upper bounded on the support of n. We can write 

logn(exp oh) = sup logn[exp o(/i A B)] = sup sup [m (h A B) — /C(m, n)\ 



Bel 



= sup sup [m (h A B) — /C(m, n)] 
= sup < sup [to (h A B)] — /C(m, n) > = sup [m(h) — K.(m,n)] 

meM\(E) 1-BeR J meM\(E) 



□ 



5.2. Proof of Theorem [Ull 

Proof of Theorem I J. 51 The beginning of this proof follows exactly the proof of The- 
orem [O] (page [7]) until Equation ll.2l Now, let us apply (to Equation 1 1.2p Lemma 
Owith (E,£) = (6 2 ,r® 2 ) to obtain: 



Pexp< sup 



jv 



Consequently 



aN 

{t e -t e ,)( Zl )A — 



> dm(9, 6') - K,(m, n n') 



Pexp< 



sup 




N 



N 



i=i 

This ends the proof. 



(£ 9 -e ei ){Z t ) A 



aN 



Rx (6,9') 

>d(p®p')(6, 9') - K.(p, tt) - K(p', it' 



= 1. 



□ 



16 



P. ALQUIER 



5.3. Proof of Theorems HH] and [HH 

Proof of Theorem \2.1[ First, notice that: 

£ (p,^en.p(-0R)) = [p(fy -7Texp(-^fl)(-R)] + K> {P, 7r) - £ (^exp^i?) , 7r) . 

Let us apply Theorem 11.51 with n = ir' = p' = TTexp(-pR) to obtain with probability 
at least 1 — e, for any p G A^+(0), 



£ (P, 7Toxp(-/3fl)) < /3 



p(r-) -7r exp( _ /3il )(r) 



-4- 7 /" ,, <(> ntonr 0'\ 4- £ (ft ^(-/JA)) 

2iV 7 e2 )"(P® 7r ox P (-/3J?.)J(o'7 fc ' J H 



7 



AA(M')d(p®7r«xp(-/JK))(0 J 0') 



+ /C (/>, 7f) - /C (•7r exp (_ /3ii ) , 7r) . 



Replacing in the right-hand side of this inequality Tr C xp(-pR) with a supremum over 
all possible distributions leads to the announced result. □ 



dp 



Proof of Theorem ] 2. 21 We have, for any 9: 
dp 



log 



dn 



(9)=(3[R(9)-7r cxp{ . m (R)] + log -£{9) - K (^x P (-^), n) 



T exp(-/3fl) 

Let us apply Theorem II .31 with it — it' = p' = 7r C xp(-/3_R) an d a general p to obtain 
with Pp-probability at least 1 — e over 9, 



log 



dp 



dn, 



exp(-(3R) 



<0) < (3 



r{0) -Tr cxp (-f3 R ){r) 



+ ^ I v a ^(6,6')dw eM - m (e') + 



log \ + K (p, 7r exp( _^ fl) ) 



+ lo S T~ ^ ~~ ^ (^xpt-^K) > tt) • 
The end of the proof is the same as in the case of Theorem 12.11 □ 



+ / A*(M')dTcxp(W0') 



dp 



5.4. Proof of Theorem I3.2L We begin by a set of preliminary lemmas and defi- 
nitions. 

Definition 5.1. For the sake of simplicity, we will write: 

r'{9,9') = r{6)-r(e') 

and 

R'{9,9') = R(9)-R(9') 

for any (9, 9') £ 9 2 . 

Definition 5.2. We introduce the margin function: 

<p : R* -> WL 



x i y sup 
see 



V(6>,6>) -xR'(9,9) 



Lemma 5.1 (Mammen and Tsybakov ! 1 5. 20 ). Mammen's and Tsybakov margin 
assumtion: 



3(k,c) e [l,+oc[xR;,V<9 e 9, F(6»,6i) K < cR'(9, 
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implies: 



Vx > 0, ip(x) < 1 



1 



(kcx) 



for k > 1 and tp(c) < for k = 1. 

Definition 5.3. We define the modified Bernstein function: 
g : R R 

2[cxp(: 

£3" 



2[oxp(x)-l- 3i ] if 



1 if a; = 0. 

The function g is a variant of Bernstein's function, used in Bernstein's inequality 
(see Bernstein |4J). Here, we prove a variant of this inequality. 

Lemma 5.2 (Variant of Bernstein's inequality). We have, for any A > and any 

(9,9') eO 2 : 



(5.1) Pexp 
and the reverse inequality 

(5.2) Pexp 



XR'(0,0')-Xr'(9,0')-—g 



X 2 / 2AC 



2AT V N 



Xr'(9,9')-XR'(0,9')-^g 
We also have a similar inequality for variances: 



V(9,9') 
V(9,9') 



< 1, 



< 1. 



(5.3) Pexp 

Proof. We have: 
Fexp[XR'(9,9') ~ Xr'(9,9')} 

— exp < log P exp 



< 1. 



iV 



. i=l 



(ffl-ifl')(^) 



XR'{9,9') 



Now, note that for any 6 > 0, for any x 6 [—6, b] we have: 



2 2 
X X 

exp(-x) = 1 - x + ys(-x) < 1 - x + yff(&)> 



so that 

log P exp 
It shows that 



--{l e -l e ,){ Zl ) 



<-XR'(9,9') + —g 



A 2 / 2CA 



2N" V N 



V{9,9'). 



P exp [XR'(9, 9') - Xr'(9, 9')} < exp 



A 2 / 2CA 
79 



V{9,9') 



_2N V N 

The proof of the reverse inequality follows the same scheme. For Inequality (|5.3j) 
note that, using the same scheme, we obtain: 



A 2 / 4AC 2 



Pexp{Xv(0,0')-XV(9,9')-—g^ y 



P 



(k-le'f(Z) 



< 1. 



This implies that 

P exp 



Xv(9, 9') — XV (6, 6') - 



A 2 2C7 2 / 4AC 



N 



N 



V{9,9') 



< 1. 



The choice A = N/4C 2 and the remark that < 2 (actually g(l) ~ 1.4) leads to 



Inequality l|5.3 



□ 
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Definition 5.4. For the sake of shortness, we put: 

[R(Si) (d, + log ± -flog 



Sffih q, e, k) — max< 



l+log 2 N 



N 



N 



Now let us give a brief overview of what follows. Lemma ROl proves that for some 
/?, Oijj achieves the expected rate of convergence in model Gj: <5jv(i, q, e, k). As we 
then want to use Theorem l3.1l to compare our estimator 9^ to every possible O^ff, we 
will have to control the various parts of the empirical bound B(.,.) by theoretical 
terms. So we give two more lemmas: Lemma [5.41 controls the empirical variance 
term v(.,.) by the theoretical variance term V{.,.) while Lemma l5~5l provides a 
control for the empirical complexity term C{i,(3). Given these three results we will 
be able to prove Theorem 13.21 Let us start with 



Lemma 5.3. Under the assumptions of Theorem \3.2i there is a constant C = 
C'(k,c,C) such that, with P &) ieI fj^ SU ppi v ) Pi.fj-probability at least 1 — e, for any 
i G I, there is a j3 — f3*(i) G supp(v) such that 

R' \0i,0,Q^J <C'S N {i,q,s,K). 

Proof. We have, by Inequality (|5.ip in Lemma [531 



1 > vr 



exp(-/3fi) 



P exp 



AR'(.,0 i )-Ar'(.,0O 



( 



2AC 



2AT V N 



V{.,9i) 



> Ppi,p exp 



\R'{.,e l )-\r'{. 1 e l ) 

>?_ (2\C\ 
2N 9 \ N ) 



A 2 /2AC\ Tr , - . , dp i0 , . 

9[ — )v(.,e i )-io g —J^ (.) 



dir 



exp(-/3R) 



Thus 



\#(.,0 i )-Ar'(. ) i ) 



~ Tn 9 C-W-) V(,~9i) -log - fa (.)+log(M(0^C9)) 

So, with P0 ie/ ^esupp(iy) Pi, /3 -probability at least 1 — e/2, for any i S / and /3 6 
supp(y) , 



(5.4) XR'{9 iS ,9 i ) < Xr'(9^,9 t ) + —a 



A 2 / 2AC 



27V a V ^ 
log 



dp 



dn 



i,/8) + l0g 



exp(-/3_R) 



ep{i)v{l3)' 



Note that, using Definition 15.21 for any a; > 



v(e hl3 ,0 t ) < 2 



V(9i,p,S) + V(9,0i 



< 2 



xR'(9i,f3, 9) + xR'(0i,9) + 2<p(x) 



Therefore Inequality l|5.4p becomes: 
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' xX 2 [2CX\ 
X -— 9 { — 



\ x\ 2 

X H q 

N y 



2CX 



R'(0u9) 



N 



-log 



d-K 1 



^)-^(<xp(-^) J ^)+ 1 °g 



R 



Efj,(i)u(/3)' 



leading to 



' xX 2 (2CX\ 
X -— 9 { — 



R'(k 



2ip(x)X 2 (2XC 
+ \' g' 



N 



N 



\ xX 2 (2CX 

^ P {-pR)R'(-^) + ^g 



R'( 



dn 



exp(— /3r) 



logTr'exp [-Xr'(.,ei)] -^(Tr^.^^Tr^+log— ^ 



(/?) 



and 
(5.5) 



' xX 2 (2CX 
A - —g 



N \ N 
2ip(x)X 2 [2XC\ n 



R'(6i,l3,9i) < 



2xX 2 (2CX\ . 



N 



- logTr'cxp [-Ar'(.,0i)] - JC (< xp( _^ R) , 



R'(6 

dpi,/3 



exp(— fir) 



-log 



ep(i)v(P)' 



We can then use Inequality l|5.2p (in Lemma l5T2l page[T7|) to obtain, with probability 
at least 1 — e/2, for any i € I and (3 G supp{v), 



(5.6) -log^exp[-Ar'(,^)] < Att, 



< Att 



exp(-/3ii) 



A 2 [2CX\ 



log 



< 



\ xX 2 [2CX\ 

A+ — n— ) 



< xp( ^)i?'(.^) + — sj^J 



2^(x)A 2 /2CA\ / 4 a , 



1 efi(i)u(0)' 

Combining Inequalities l|5.5p and (|5.6p we have, with probability at least 1 — e, for 
any z and /3: 



(5.7) 



A- 



a;A 2 /2CA\ 



4xA 2 / 2CA\ . 



R' 



4ip(x)X 2 [2XC\ 



\ xX 2 (2CX 
X + —g' 



N \ N 
+ log 



dpi,8 



nl cxp(-l3R)R'(-i®i) 



exp(— /3r) 

In order to make explicit the terms in Inequality 15.71 let us remind the definition 
of pi t p in Theorem 13.21 (page [13]) and remark that 



log 



dpi, 



1 



, , ^i,p) < log- 

^xp(-/3r) 1 
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Let us also recall the dimension hypothesis in Theorem I3.2[ implying that 

<xp(-^'(-^) < ^ 

Let us finally choose A = 2/3, Inequality 15.71 becomes: 



(5.8) 



„ 16x/3 2 /4C/3\ 



N 



\ N J 



N \ N J 



1 2 
log - + 2 log — . 



Finally, Lemma 15.11 together with the margin assumption in Theorem 13.21 ensures 
that 

fix) < ( 1 ) (kcx) 1 ^ 

\ «/ 

if K > 1 and ip{c) < if K = 1. Let us first deal with the case K = 1. Inequality 
(|5.8p becomes, taking x = c, 



(5.9) R'(ei, fi ,9i)< 



1 



4c/? / 4C/3\1 _1 J 16c/3 /4C/3\ - - 



Ac/3 ( AC [3 
1 + — 9 { — 



di 1 , 1 2 , 2 
-t + 75 log - + 77 log — 7TT-7777 > ■ 

In the right-hand side of Inequality l5.9l the numerator is optimal for /3 of the order 
of 



\ 



N (di + log i + log 



#'(0*0) 

but in order to keep the denominator away from zero, the maximal order of mag- 
nitude for (3 is N, so let us take j3 of the order of 



\ 



N (dj+hgj+hg^^j 



This choice leads to: 

(5.10) R'(6 it0 ,6i) <C"max\ 



[R(6 t ,6)] (*+]<*} + logi^* 



N 



iV 



= C"fo(i,«,e,l) 



for some C" = C"(c, C). In the case where k > 1, Inequality Q5.8J1 becomes: 



(5.11) R'{6 it 0,ei)< 



1 4x/3 /4C/3\ 

2 ~ ~ 5 l^w 



16x(3 /4C/3\ , -j ^ 



N 



— 



R' 



1 - 



1\ 16/3(/ccx) _ — /4/3C\ 



iV 



I—/ 



4x(3 /4C/3\ 
- — log - + — log 



/3 



/3 °, °e l i[i)v{P) 
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Now, we choose x or the order of 

mm\[R' (6 t ,9)] 



K-i N 
'J 



in Inequality 1|5.11|1 (the case x — \R' (6*^,0)] " minimizes the numerator while 
the fact that x = 0(N/f3) ensures that the denominator does not get too close to 

zero). Now, let us consider both cases for x, and first x — [R' (6i, 0)] " . In this 
case, let us choose (3 of the order of 



N (di + log i + log 



\ [R'(0i,9)Y 



This leads to a bound of the order of 



max< 



[R(6i,6)]" (d. + logi + log 



l+log 2 N 



N 



di + log i + log 



l+log 2 N 
Sfj,(i) 



N 



< 5 N (i,q,e,K). 



In the other case, x is of the order of N/ (3 and 



implying that 



[r' (o u e)Y 

R' (9^9) < 



> 



N 



0' 

N 



We have to choose (3 in order to optimize the numerator, in this case the optimal 
order of magnitude is 



, . 1 . l + log 2 iV 
di + log - + log f^— 



N 



and leads to a bound of the order of 



'di +logi + log 



l + log 2 N \ 2k-1 

sfj,(i) \ 



N 



< 5jv(i,S)£) «)• 



So we have proved that, in the case k > 1, for some C" = C'"(k, c, C), 

(5.12) &faM<C m Sir(hQ>e,*)- 
We put: 

( C"(c,C) if k= 1 
C'(k,c,C)=\ 

{ C"'(k,c,C) if k> 1 

and remark that Inequalities (|5.10p and (|5.12p end the proof. 

Lemma 5.4. Under the assumptions of Theorem 1 3. 21 with P p<z sup p(,s) Pi,p~ 
probability at least X — s, for any € I 2 , for any (ft, 7, j3' , 7') S supp(^) 4 : 



□ 
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■T> x 



0',y (pi',?,**) + u+ 



+ - 37 log- 



7-/3 y-/3' D £ M (i) M (i'), 
Proof. According to Inequality l|5.3p (Lemma 15.21 page tTTJ) , 



P exp 



< 1. 



exp(-/3ii) w "exp(-/3'ii) 



Let us integrate in (8, 9') with respect to the distribution 7r 

and sum over all i, i', (3 and 0' to obtain, with P(^) ig/ p^ supp ^ v ) Pi, (3 -probability at 
least 1 — e/3, for any € I 2 , for any {(3,(5') G supp(v) 2 : 

v(8~i,0,8~i',0>) <2V' (o^pA',?) 



AC 2 
N 



log 



'i,/3 



log 



dp 



'i',/3' 



(fy,/?') +log- 



'exp(-^R) " exp(-/3'_R) 

To conclude the proof, there remains to combine this result with Theorem 12.21 page 
fTUl using a union bound argument. □ 



Lemma 5.5. Under the assumptions of Theorem \3.2l there is a constant K — 
K(k,c,C) such that, with P &> ieI p^ S upp{v) Pi,p -probability at least 1 — e, for any 
i S I, there is 7 £ supp(v) such that, for (3 = (3*{i), 

V Uja { Pl ^y) < C(i,f3) < KS N (i,q,e,K)/3. 

Proof. We have 
fiA7(Pi,/3,7r l ) (Oi,pj 



= 11- 



7, 
<(l 



log 



dpi, 



exp(— 0r) 

log — h log 7T l exp 
9 



(fl'i./s) + log<xp(-/3r) ex P ^ w (•» &,/s) 



£7 

27V 



log tt 1 exp [-/?/(., 0)] 



Let us now apply Lemma 15.21 and the now usual integration technique to obtain 
the following inequalities, with probability at least 1 — 4e/5: 

- log tt* exp [-/3/ (.,£)] =- sup [-PpSi.ft-Kfan*)] 



< — sup 



-PpR'(.,0) + ^ 9 ^f-)v(.,0) 



log--/C(p )7 r < ) 



< -log7r l exp -(3B!{ 



£_ (2§c\ 

2N 9 V N 



\ V{.,6) I +log 



Moreover 
log 7T l exp 



/3j_ 
2N 



< log^exp/^nA/^'O^) + |^.9 (^) 



/?74C 2 ■ /- \ 

+ N2 v^p^pi^y) [8 h p j 



A(3-/C 2 4 7 C 2 /3 



2 02 1 



N 2 N(~f - (3) 



log-, 

e 
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SO that 



1 £ Pl^C 2 
7 N 2 



Pi 



< log - + logTT^p^^j exp<| 



< log - + log7r^ xp( _ i9fl) exp I x 



V{.,6) 



N N \ N J 

4/3 7 C 2 4 7 C 2 /? 2 



iV 2 N(j-/3) 



log- 



f3 1 2 {20C\ 



-N + N 9 



+ 



2£y (20C\ 
N N 9 \N J 



\xR'( 



2 | 4/3 7 C 2 4 7 C 2 /3 2 



log-. 

e 



TV 2 N(-f-P) 
We then apply Lemma l5~3l to obtain with probability at least 1 — e/5 

I^(9i,pA)<C'8 N {i,q,€/5,K). 

Moreover we can choose 7 = 2/3, and remember that the choice j3 — j3* (i) leads to 

/3 < N, so 



(5.13) 



1 /3 2 8C* 2 



N 2 



Pl,/3,2/3(p l ,/3,7T l )(^,/3) 



< log - + log 7T* 



exp(-/3fl) 



exp|^[2 + ff (2C7)]i?'(.,^) 



+ ^ [2 + .9 (2C)] [ari?^, 6) + <p(x)] + ^f-C'5 N (i, q, e/5, «) 



12/3 2 C 2 
iV 2 



log- 



Now, let us compute: 

log 74^.^ expj ^ [2 + g (2C)} #(.,0,) 



<^[2 + .9(2C)] / Kvi-ft-VM**)]}^™* 



< jVj - - 

" [2 + 9 (2C)]} 



[2 + s(2C)] 



:r7r exp{-/3[l-^(2+ s (2C))]} jR '(-'^) 

a;di/3 2 + g(2C) 



< 



N i-f[ 2 + g(2C)} 

by the dimension assumption, and so for any x smaller than N/0, Inequality 15.131 
becomes 



(5.14) 



1 (3 2 8C 2 

2 ~ N 2 



Pi,/3,2/3(Pi„a,7r l )(^,/3) 



< 20C'S N (i, q, e/5, k) + 0{ \ log - + %■ 2 + g{2C) 



(3 q P 1 -f [2 + g(2C)) 
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N 



[2 + .g(2C)] [xR'( 



. , n 2 + 12<?(2<7) 1 5 
+ <p{x)] + j^ l0 %- 



The optimization of the right-hand side of Inequality (|5. 14f) with respect to x and 
(3 leads to the same discussion as for the optimization of the right-hand side of 
Inequality l|5.8p (page [20)) in the proof of Lemma 15.31 (and a choice of x satisfying 

x < N/0). □ 

We are now able to proceed to the 

proof of Theorem \3.2l With P ® igJ 0^ S upp(v) Pi ,,3 -probability at least 1 — 4e the in- 
equalities stated in Theorem [3J] and in Lemmas l5.3U5.4l and l5.5l are simultaneously 
satisfied. In this case, let us choose i £ I, (3 = f3*(i) and j such that P = (i,f3). 
We have: 

0, 1 < j < s (case 1), 

B(t s V>,P) + B(t j ,t a V>) s<j<k (case 2), 
B(P,t s )+B(t s ,P) 

+B(t, P) + B(P,i) j e (argmaxs) (case 3), 
B(P , t) + B(t, P ) , otherwise (case 4) . 

Let us examine successively the four cases (1, 2, 4 and 3, this last case being the 
most difficult). 

Case 1: if 1 < j < s, then 

and so, by the result of Lemma [531 (page fT8l) . 

R'(e h 9i) <C'5 N (i,q,e, K ). 

Case 2: the idea in all the remaining cases (2, 4 and 3) is that we have to 
give a control of R'(6f,0^ t ^, controlled by the empirical bound B(., .), in terms 
of theoretical quantities only. In case 2, s < j < k, then for any A S supp(v), 

#'(6U),0 M) ) < B(fW^)+^,fW) 

C(t^))+C(P) + ^\og^ 



,(A) 



A 



\ , N C(P) +C(t s v>) + ^ log -±r. 

< ± V (t s ^,P) + — - — - — ; A c ' 1 SV(X) 



4C 2 A 

N 2 



C(P) + C{t s{ ^) + 



1 







7 log- 



As we have, by definition of the function s(.), the inequality C(t s V>) < C(P) 

3 



A 



and so 



4C 2 A 
N 2 



2A r 



2C{P 



(3 



7-/3 7' - P' 



7 log 



(5.15) E'(0 tsO) ,0 t3 ) < — +^'(^,0) + 



2C(^) + i±llog^ 



e „(A) 
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4C 2 A 
N 2 



2C(t j 



+ 



0' 



7-/3 i ~ 0' 



7 lo S 



en(i)/j,{i') 



Thus 



1 - 



2Xx 



R'(6 t s(j),0 t i) < 



2xR'{9 tD , Oi) + 2xR'{6i,e) + <p(x) 



A 

4C 2 A 
" TV 2 



2C(t j 



7-/3 7 



7^) log 



en{i)n{i') 



Let us apply Lemma [5.51 page l22l to upper bound C(P), Lemma [5,31 page [T8l to 
upper bound R'(9 t j,9i) and Lemma [5TT1 page [161 to upper bound tp(x). Let us put 
moreover A = 7 = 7' = 2/3 = 2/3' and remember that f3 < N. We obtain, for any x 
such that 2; < N//3, 



1 - 



4/3a; 



iV 



R'( 



< 



4y3 
/V 



2xR'(6i,6) + 1- - (kcx) — 



+ + 32C 2 ) + 8C']^(i, g, e, k) + + log ^tt 2/3 

2/3 C — 1 e^(A) 

48C 2 



+ 



■log- 



N efi{i)n(i f ) ' 

Let us replace x and /3 by the values given in the discussion for the optimization of 
the right-hand side of Inequality l|5.8p (page [20]) in the proof of Lemma 15.31 (and a 
choice of x satisfying x < N/ /3) to obtain the existence of a constant V = V (k, c, C) 
such that 

R'(e t , u) ,6 t i) <V'5 N (i,q,e, K ). 
We then deduce from this result and from Lemma 15.31 that 

R'(0{A) < R'{9 h e~v)+R'{dpA) < (V + C')6 N (i,q,e,n). 

Case 4: the proof follows roughly the same scheme than for case 2; if j > 
max(argmaxs), note that C(P) > C(t), therefore 

R'(9 h 9 (hP) ) <B(i,V) + B(t\i) 

<A„, MJ) + 2C " J > + ^'°^ 

- 2N v ' ' A 

\ 2C(P) + £±£ log -4n 



4C 2 A 
N 2 



2C(t j 



jylog- 



< 



2A 
~N 



xR'(9 i ,9)+xR'(9 tJ ,9) + cp{x) 



7-/3 7 '-/3' °eM(*)/*(i') 
2C(^) + ^±llog^ 



4C 2 A 
N 2 



2C(t j 



(3 



7 log- 



Thus 
1 - 



2Ax" 



R'(9 b 9 



< 



4Ax 
~N~ 



R'(9 



(*,/3), 



7-/3 7 '-/3' sfi(i)»(i') 



i?' 
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JV A 
4C 2 A 



N 2 



2C(t J 







P' 



7-/3 i~P' *ef*(iW) 



Let us apply Lemma [5.51 page l22l to upper bound C(P), Lemma [5,31 page [T8l to 
upper bound R'(9 t j,9i) and Lemma 110 page [TBI to upper bound tp(x). Let us put 
moreover A = 7 = 7' = 2/3 = 2/3' and remember that < N. We obtain, for any x 
such that x < N//3, 



Af3x 



R'( 



.%/3)) 



< 



4/3 

77 



2x#'( 



1 - 



(kco;) 7 



+ [K(l + 32C 2 ) + 8C']5 N (i, q, e, k) + — 



1 , c + i 



log- 



2/3 ' C - 1 "° e^(A) 

48C 2 , 
■log- 



2/3 



Choosing a; exactly in the same way as in the previous cases and replacing (3 — P*(i) 
with its value, we obtain the existence of T>" = T>"(k, c, C) such that 

R'(6 h 9 (i>0) ) <V"6 N (i,q,e, K ) 

and so 

< (C' + V")5 N (i,q,e,K). 
Case 3: if j £ (argmaxs), remember that s = s(t) = s(j), so that 

(5.16) B!(e h 6 t i) < \B{t>,fW) + B(t s ^,t j )] + [B(t,t s )+B(t°,t) . 

We are going to upper bound separately B(P,t s ^) + B(t s ^\P) and B(i,t s ) + 
B(t s ,t). Let us first deal with the term B(t j ,t s ^) + B(t"V\t j ): 

\ 2C(P) + log 3 

(5.17) [B(V ,t s U)) + B{t s ^ < ^(f«,t j ) + £t/(A) 

<A v(t . 01 , <i) l 2C " i ' + ^ 1 °^ 

~~ N ' 7 A 



4C 2 A 
N 2 



2C(f) 



P 



P> 



7-/3 7'-/?' ° eix{t)ii{%>) 



7 log- 



< 



2A 
77 



xR'(8 t s U ) Jv) + 2xR\e tJ , Oi) + 2xR'(6i,6) + ip[x) 



2C(*0 + £l log ^ 



4C 2 A 
N 2 



2C(t?) 



P 



P' 



7-/3 7'"/?' ej*(W) 



7 log- 



Let us notice that 

R'(9 ts(j) ,8 t3 )<B(t^\V) 
and remember that, by definition, B(P,t s ^') > 0. This shows that 

R'{6 t >U)M < [B(t\t s ^) + B(t<i\V) . 

Once again, let us apply Lemma f5.5l to upper bound C(P), Lemma f5.3l to upper 
bound R'(9 t j,9i) and Lemma [BTTl to upper bound <p{x). Let us put moreover A = 
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7 = 7' = 2/3 = 2/3'. Inequality 15 . 1 71 becomes : 



1 - 



4/3x 



N 



B(t j ,t s ^) + B(t s{ i\t j ) 



< 



40 
N 



2xR'{6 u 6) + I 1 - ^ ) {kcx)^ 

i , C + i 



+ [K(l + 32C 2 ) + 8C']5 N (i, q ,e, K ) + — + ^— log —prr20 

2(3 Q — 1 ev[X) 

48C 2 



log 



N ° £fJ,(i)fi(i') 



and therefore 



B(t j ,t s ^) + B(t s ^\t j )] < 8S N (i,q,s,K). 



There remains to upper bound 
the fact that C{t) < C(P): 

B(t, t s ) + B(t s ,t) 
2X 



B(t,t $ ) + B{t s ,t) 



We will use to that purpose 



< — 

" N 



xR'{6 h 6 fj ) + xR'(6 t sU) ,0 t i) + 2xR'{9 t} ,6,) + 2xR'(9 l , 9) + <p(x) 



2C(^) + |±ilog 



4C 2 A 
N 2 



2C(^)+fl + ^- + ^— log 

V 7-/3 7 - P 



Note that we have already proved that 

R'{e t . U )Ai)< [B(t*,f®)+B(fV\l?j\ <eS N (i,q,e,K). 
Plugging all these results into Inequality (|5.16p . we obtain, 
2 Ax 



I - — )R'(9 h 9 t] ) <£S N {i,q,e, K ) 



+ 



2A 



x£5 N (i, q, s, k) + 2xR'{9 tJ ,9i) + 2xR'(9 u 9) + <p(x) 
2C(P) + ^log^ 



4C 2 A 
N 2 



2C(t j 



(3 &_ 3 

7-/3 Y-/3> ° S sfx(i)»(i') 



As usual, let us apply Lemma l5T5l to upper bound C(P), Lemma l5~3l to upper bound 
R'{9 t j , 9i) and Lemma l5TT1 to upper bound <p(x). Let us put A = 7 = 7' = 2/3 = 2/3', 
to obtain 



1 



A(3x 



R 



< 



A0 
N 



2.i-n'[9 ; .H) - ( J - ij (kcx)*=t 

1 , c + i 



+ 32C 2 ) + 8C + 3£}S N (i, q,s, K ) + — + log -^tt2/3 



48C 2 3 



and therefore 

This ends the proof. 



R 



'(9 h e t i) <£'5 N {i 7 q,e,K). 



□ 
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Appendix : bounding the effect of truncation 



We will show here how to upper bound R(6) — R(6') — R\(6,6') by some quantity 
A\(8,9') depending on an additional hypothesis on the data distribution. 

Lemma 5.6. Let us assume that we are in the i.i.d. case, where Pi = ... = Pn 
and that for some constants (b, B) £ 

V0ee, Pi{exp[6|i 9 (Zi)|]}<B. 

Then, for any {0,9') e 6 2 , 



R{9) - R{6') - R X (6, 0') < A x (6, 6') = — cxp 



-bN 
2A 



Proof. From definitions, 
R(6)-R(6')-R x (6,6') 

= Pi ^le(Zi) - le'(Zi) - p fl (Zi) - l e >{Zx)] A y J 



Pi 



l 9 (Z x )-l e .(Zi) 



N 



where (x) + — x A 0. So we can write 
R{6) -R(6') -R x (6,6') 



' / Pi 



I 

JO 



h{Z x )-l e ,(Z x ) 



I 

Jo 



r + ao 
I Pi 



N 



N 



> t 



dt 



le{Z x )-le>{Zx)- — >t 



dt 



< 



J °° Pi jcxp h - (l e {Zi) - le>{Zi) - y - *) } 



dt 



< exp 



bt 



cxp ( -— ) dt, 



leading to the result stated in the lemma. 



□ 
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